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Abstract 

Statistical inference of genetic regulatory networks is essential for un- 
derstanding temporal interactions of regulatory elements inside the cells. 
For inferences of large networks, identification of network structure is typ- 
ically achieved under the assumption of sparsity of the networks. How- 
ever, current approaches either have difficulty extending to networks with 
a large number of genes due to the computational constraints, or are 
difficult to interpret due to the use of module-based models. Also, in 
most previously proposed models, the choice of different parameters in 
the model is dealt with in a heuristic manner. For example, in LEARNe 
(|Nam et all\200i ). which used a system of differential equations to model 
the dynamics, an upper bound for the degree of connectivity of the net- 
work is assumed to be known. 

When the number of time points in the expression experiment is not 
too small, we propose to infer the parameters of the ordinary differential 
equations using the techniques from functional data analysis (FDA) by 
regarding the observed time course expression data as continuous-time 
curves. The derivative of the expression curve with respect to time is 
easily calculated without using finite difference and thus the problem of 
missing observations or unequally spaced time points can be dealt with 
in a consistent way in this model. For networks with a large number of 
genes, we take advantage of the sparsity of the networks by penalizing 
the linear coefficients with a Li norm. The model is fitted using the effi- 
cient algorithm that finds the whole regularization path of the parameters 
which in turn produces the whole spectrum of possible performances in 
terms of sensitivity and positive predictive value (PPV). The smoothing 
parameters can be chosen via cross-validation which gives a reasonable 
trade-off between PPV and sensitivity. The algorithm is compared to 
the state-of-the-art program LEARNe in simulations for small to medium 
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sized networks which shows the competitiveness of the new approach. The 
ability of the algorithm to infer network structure is demonstrated using 
the cell-cycle time course data for Saccharomyces cerevisiae. 



1 Introduction 

The increasing amount of high-throughput time course data has provided biolo- 
gists a window to the understanding of the biomolecular mechanism of different 
species. The expression of genes in these studies are indicative of the dynamic 
activities occurring inside the organism. Such regulatory activities involve com- 
plicated temporal interactions among different gene products, forming genetic 
networks indicating the causal relationships between different elements. It is the 
responsibilities of the statisticians to construct such networks using statistical 
models that uncovers such relationships. The utility of such models would be 
vital for the discovery of biological processes that are crucial for understanding 
of interactions of molecules involved in drug responses. 

Different models have been proposed for the construction and analysis of 
such networks from time course data, inclu ding stocha s tic di fferential equations 



(jChen et all l2005f ) and graphical models (jFriedmanl . l2004f) . Bayesian meth- 
ods can be applied which explore hidden causal r elationships betw een different 
nodes. In particular, dynamic Bayesian networks (|Yu et has achieved 



great success. For optimization of the networks, a large sample size is required 
for accurate estimation of the structure and the amount of computations re- 
quired increases rapidly with the number of nodes, thus limiting the application 
to networks with a small number of genes. 

Due to the obvious connection of the problem to the traditional time series 
analysis, multivariate autoregressive model has been used to fit the expression 
data. The utility of this model is constrained by the length of the time course 
data which is typically on the order of tens of times points, while the number of 
genes is much larger. Classical maximum likelihood estimator cannot be applied 
when the number of genes is greater than the number of time points. 

To overcome the problem, several algorithms have been introduced, almost 
all of which applied so me dimensionality reduction techniques. Subset selection 



(jGardner et all 120031 ) proceeds by searching for a subset of nodes with a fixed 
size that minimizes the least mean square error and use this subset for model 
fitting and inference. This typically leads to poor generalization performance 
due to overfitting because the number of time points is small. In addition, one 
needs to choose the size of the subset for searching and least square errors for 
subsets with different sizes are not directly comparable to each other. This 
choice would depend on one's belief about the degree of connectivity of the 
networks, which is difficult to assess a priori in applications, and choosing a 
too small subset size obviously is disastrous to the performance. Alternatively, 
singular value decomposition (SVD), utilizing a parsimonious set of features 
underlying the data, can be used to reduce the dimension and infer the networks 



(|Bansal all 120061 ) . but simulations showed that its performance is sensitive 
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to the number of features selected, which is difficult to determine in practice. 
Based on similar principles, the state space approach uses low dimensional latent 
variables to reduce the number of parame ters to be est i mated , but the resulting 



latent variables are difficult to interpret (jHirose et aLl . 120081 ). The difficulty in 



interpreting the model also originates from the fact that the estimated networks 
do not have the sparsity property. Besides, it is difficult to apply the state space 
model to time course data with unequally spaced time points. 

To address th e previously mentioned overfitting effect of subset selection, 
(jNam et all 120071 ) proposed combining multiple models with least mean square 
error below a certain threshold. The motivation comes from the machine learn- 
ing literature where model averaging or multiple voting is often observed to im- 
prove generalization performance. It was shown that this approach significantly 
outperforms simple subset selection and is at least comparable and sometimes 
better than SVD even when the number of features in SVD is optimally chosen 
in different situations, except when the number of time points is extremely small 
(six or smaller). The resulting algorithm named LEARNe, however, shares one 
common disadvantage with subset selection: an exhaustive search over all pos- 
sible subsets with a fixed size is conducted to find the good performers, which 
is infeasible when the number of genes is large. Additionally, some arbitrary 
threshold should be used to find the final connectivity structure of the networks. 
Due to the heuristic nature of the algorithm, no statistical theory seems to exist 
for the choice of this threshold. 

All the above mentioned approaches directly used expression data at dis- 
crete time points for fitting the model, which might be undesirable when the 
noise level is high. This is especially true when we use the ordinary differential 
equations to model the networks which requires estimation of the derivatives. 
In both LEARNe and SVD algorithms, the derivatives are replaced by the dif- 
ference of expression level between consecutive time points, which might be a 
poor estimate of the derivative. 

In this article, we propose a novel algorithm, using a penalized form of 
functional regression, by modelling the time course expressions levels over a 
certain period as continuous curves after smoothing the expression data. The 
sparsity of the networks is enforced by introducing Li penalty on the constant 
coefficients of the differential equations using another smoothing parameter. By 
varying the smoothing parameter, we can trace out the whole performance curve 
of our algorithm, which can be done using the modification of the least angle 



regression (LARS) program (jEfron et ali . |200J). The sparsity of the network 



can be inferred based on the data using cross-validation (CV) if desired. Un- 
like LEARNe, the algorithm is very efficient in computation and can deal with 
graphs with several hundred nodes when implemented in R on a personal com- 
puter. Our simulation shows that the algorithm has comparable and sometimes 
better performance than LEARNe, while sparsity is obtained without choosing 
an arbitrary threshold for the coefficient matrix. 
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2 Methods 



As a first step, we need to convert the time course expression data to continuous 
curves. This problem falls into the rea lm of functional data analysis (FDA) as 
studied extensively in the monograph ( Ramsay and Silverman . 20051 ). We use 



gij,i = 1, . . . , n, J = 1, . . . , n.i to denote the expression level of gene i at time 
points tj with < ii < . . . < t„; < 1. These expression levels have possibly been 
preprocessed and log transformed which usually results in better fit. Separately 
for each gene i, we search for the smooth function gi{t) that minimizes the 
following functional, 



g^{tJ)f + Xl {9':{t)fdt. 



In the above, the first term enforces the closeness of the curve to the ob- 
served expression level at the discrete time points, the second term enforces 
the smoothness of the function gi with larger smoothing parameter Ai resulting 
in a smoother function. Note that with this approach, we don't need to assume 
equally spaced time points or identical time points for different genes. 

In practice, the above optimization is perform by assuming gi has an expan- 
sion in terms of a certain basis 

K 

3 = 1 

After plugging in the above expansion, we only need to solve the K dimensional 
parameters vector {oy}, which is an easy convex optimization problem. 

The perhaps most popular basis used in this context is the B-spline basis with 
order 4 (i.e. cubic splines). K can be chosen large enough while smoothness of 
the function is control by the smoothing parameter Ai . The automatic choice for 
Ai can be made using statistical methods like cross-validation. In our experience, 
we find that the result is quite robust to the choice of this parameter and we 
find it more convenient to use a fixed parameter in all experiments. 

In general, the dynamics of regulatory networks can be written nonparamet- 
rically as 

<7:W=/(5iW,---,5n(i)),te[0,l] 

for a network with n nodes. Inference of such general nonparametric model is 
difficult with limi ted amount of data. As a first order approximation, same as 



( Nam et aZ.1 . 120071 ). we model the regulatory networks using the system of linear 



ordinary differential equations 

n 

g[{t) = a, + Y^ /3y.9j (i), t S [0, 1], i = 1, . . . , n 

with f3ij representing the regulatory effects of gene j on gene i. The interpre- 
tation is that gene j activates gene i if Pij > and gene j depresses gene i if 

P^J < 0. 
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From the estimate of gi{t), the derivative can be easily evaluated by g'^{t) 
aijb'j{t). The network coefficients a^, Pij can be fitted by minimizing 



Jo 



Unlike the discrete least squares, even when the original time points is smaller 
than the number of genes, there usually exists a unique minimizer of the above 
problem. However, overfitting still occurs when the number of genes is large. 
From biological considerations, the network is usually sparse with the evolution 
of the expression level of one gene only depending on the expression levels of a 
few other genes, which implies most of the interaction coefficients Pij are actually 
zero. The Li-norm penalty, also com monly called lasso penalty, is well-known to 
produce sparse regression coefficients ( Tibshirani . 1996 1). Despite its popularity. 



we are unaware of its previous application in functional data analysis. 
With the lasso penalty added, we will optimize 







where f3i — {Pn, . . . , Pim} is the network coefficients and A2 is a smoothing 
parameter with larger A2 producing sparser networks. By varying the smoothing 
parameter, we can produce a whole spectrum of networks with different degrees 
of sparsity. We approximate the integral jQiglit) — on — (3ijgj{t))'^dt by the 
discretized 

t=l j 

T is chosen to be large enough to approximate the integral well, and we find 
T = 20 is sufficient for our simulations. 

After discretization, the optimization problem becomes a standard linear re- 
gression with lasso pena lty, which can be solved after converting to a quadratic 
programming problem ([Tibshirani . Computation with different values 



of A2 makes this algorithm less efficient. Fortunately, there exists an algorithm 
that computes the whole regularization path for the coefficients for all values of 
the smoothing parameter A2 which makes our approach very efficient compu- 
tationally. This algorithm is a modification of least angle regressi on and takes 



advantag e of the fact that the solution path is piecewise linear (jEfron et al. 
[2004 : Rosset and Zhul . 120071) . 



If one desires to choose the parameter A2 based on the data, we can either 
use cross-validation within each gene which results in a different smoothing 
parameter for each gene, or we can use cross-validation on the whole dataset, 
which produces a single smoothing parameter for all the genes. Since the lasso 
penalty shrinks many networks coefficients to zero, we can infer the sparsity 
of the networks based on the data without using arbitrary thresholds for the 
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coefficients as is commonly done in previous approaclies when the structure of 
the network is unknown. 

Since we will compare our approach to LEARNe, we will briefly describe that 
algorithm here. In LEARNe, one performs a least square regression for each 
subset S of {1, . . . , n} of size k. That is, for each fixed gene i, one minimized 

rii-l 

j=i ses 

where Ag^ = 9i(j+i) ~ 9ij ^-i^^ = ^j+i ~ ^j- Thus LEARNe uses finite 
difference method to approximate the derivative of the expression level with 
respect to time. Each possible subset S corresponds to a different linear re- 
gression model. Instead of using one single best model, which will overfit the 
data with a small number of observations, one collects the top fi% models with 
the smallest sum of square above. This represents all the models that can fit 
the data reasonably well. Each model will vote independently on the signs of 
the network coefficients and the votes are collected into a n x {n + 1) matrix 
B, whose entries are integers with a large positive integer indicating a strong 
activating effect and large negative integer indicating a strong repressing effect. 
The final model is found with a thresholding procedure on O. If the coefficients 
are desired, it can be calculated by least square regression with the final mod el. 



No suggestion was provided for choosing the threshold in (|Nam et a/.l . 120071 ). 



The authors of ( Nam et'all . |2007( ) found by simulations that the result is 



robust to the choice of fj, and the method consistently outperformed subset 
selection using a single model with the smallest least square error. It is also 
better than SVD unless the number of time points is unreasonably small. The 
most serious drawback of LEARNe in our opinion is the computational burden 
when n is large. The author used fc = 4 in their simulation and this results in 
searching among (2) models. For example, when n = 100, it contains close to 
4,000,000 possible models! Empirically, even for n = 50 with A; = 2, we find our 
algorithm is much faster than LEARNe, although this might be attributed to 
our po or impl ementation of LEARNe. 



In (|Nam et al.. . ,2007). no discussion is offered on the choice of fc. This value 
obviously should depend on the sparsity of the networks. In practise, since the 
connectivity of the networks is unknown, it is difficult to choose appropriate fc 
especially considering the computational complexity that comes with large k. 



3 Results 

We compare the performance of our functional analytical approach with LEARNe 
in simulations. The networks are generated as follows. For an even number of 
genes n = 2r and m time points {1/m, 2/to, . . . , 1}, The nxn coefficient matrix 
A is generated as follows. 

^2i-l,2i-l — ^2i,2i = ai,^2i-l,2i — " ^2j,2j-l = ^i, 
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all other i , j 



Uniform{ 



2,0),b, - Uniform{-5,5) 



The structure of A has the form 



-bi 







ai 








0-2 






b2 




and the system of differential equations 



is written in matrix form 



G" 



AG 



with G{t) — {gi{t), . . . ,g„{t))'^. Note the coefficients ai do not appear in this 
simulation and also not used when fitting the model. 

We generate the initial expression level G(0) from Uniform distribution and 
solve the initial value differential equation problem using simple Euler method. 
The solution G is evaluated at those m time points and independent normal 
noise with variance ct^ is added at each time point. 

By the data generation mechanism, the evolution of one gene only depends 
on the expression level of itself as well as one other gene. The coefficient values 
tti are chosen to be negative so that the solution of the differential equations is 
asymptotically stable to avoid numerical problems. 

We used several combinations of parameters n, m, a for our simulation. For 
each combination, we use 50 randomly generated time course expression matrix 
and calculate the average performance over these data. In our algorithm, by 
vary A2 , we can reconstruct networks of varying degree of sparsity. By count- 
ing the number of connections in the reconstruction and the true network, the 
performance can be measured using the positive predictive value (PPV) versus 
sensitivity plots. 

The simulation results are shown in Figure [T] and Figure [21 These curves are 
produced by averaging over 50 randomly generated data sets with smoothing for 
visualization. That is, each curve is the result of using nonparametric smoothing 
over 50 PPV vs. sensitivity curves after applying a particular algorithm, either 
penalized functional approach or LEARNe. From those figures, it can be seen 
that for networks with 10 nodes, when the number of time points is small and 
with small noise level the performance of our algorithm is similar to LEARNe. 
But for simulation with a larger number of time points and larger noise level, the 
performance of our algorithm becomes better than LEARNe. This is possibly 
due to the fact that the finite different approximation to derivatives used in 
LEARNe comes into trouble in these situations. We also performed simulations 
with n = 30 and n 50 and observed similar effects. More importantly, we 
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(a) 




Figure 1: PPV vs. sensitivity plots for simulated networks with 10 nodes and 
(a) 10 (b) 30 (c) 50 time points. The standard deviation of the noise is (7=0.1 



(a) 




Figure 2: PPV vs. sensitivity plots for simulated networks with 10 nodes and 
(a) 10 (b) 30 (c) 50 time points. The standard deviation of the noise is (7=0.3. 



are able to run our algorithm on networks with n = 500 (taking about 20 
minutes) while it is impossible to run LEARNe with n bigger than 100 in our 
implementation. 

In the above simulations, we used k = 4 when applying LEARNe to the 
simulated data, where k is the subset size to search over in LEARNe. In these 
simulated data, it is known that the connection size is actually 2 for each gene. 
Curiously, as shown in Figure[3l using k = 2 results in much worse performances. 
As expected, if we use k = 1, the result is even worse. This shows that the 
performance of LEARNe depends critically on the size of subsets searched, which 
is in turn constrained by the computational resources available. 

We demonstrate the performance of the our penalized functional model with 
the application to the cel l cycle regulatory netw ork of Saccharomyces cerevisiae. 
The dataset comes from (jSpellman et al. I. ll998l) which provides a comprehensive 
list of cell cycle regulated genes identified by time course expression analysis. 
We use the 18 time points of the alpha factor synchronized expression data. This 
dataset has been used wid ely for evaluating a wide variety of statistical models. 
Same as Nam et "aLl ()2007l) . we consider 20 genes including 4 transcription factors 
known to be involved in regulatory functions during different stages of the cell 
cycle. 

We apply our approach to this dataset. The temporal evolution of these 20 
genes are shown in Figure [3] after B-spline smoothing of the expression data. 
To get a final model with data-based inference of network structure, we use the 
smoothing parameter selected by cross-validation with the same smoothing pa- 
rameter for all 20 genes. The final result with the interactions between each gene 
and four transcription factor is shown in Table[Tl and we compare the result with 



known interactions retrieved from the YEASTRACT database (Teixeira et al. 



20061) . For this submatrix, we get PPV=0.54 and sensitivity=0.80. Since all 



statistical models are merely mathematical approximations to the true world, 
it is plausible that automatically chosen model undersmoothes the coefficients 
matrix to provide a better fit to the data. One can also manually specify the 
smoothing parameter to achieved desired sparsity of the networks. 



4 Discussion 

We described a new algorithm for network construction using time course expres- 
sion data. The algorithm is based on functional data analysis with connection 
coefficients regularized by the lasso penalty. This is a very powerful approach 
that makes inference for large networks possible due to the efficient optimization 
procedures previously proposed. 

Our new algorithm provides several advantages over some previous approaches. 
First, it achieves noise reduction by regarding the expression data as continuous 
curves. High noise contained in the expression data usually disturbs inferences 
when one uses finite difference to approximate derivatives in the model. Sec- 
ond, the fitting procedure can be made fully automatic with little intervention 
from the user. The parameters in the model can be chosen using well studied 
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Figure 3: PPV vs. sensitivity plots for simulated networks with 10 nodes and 
10 time points. The standard deviation of the noise is cr=0.1. The algorithm 
LEARNe is applied with subset size k — 4, k — 2 and k = 1. 
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Figure 4: The expression levels of 20 genes from the cell cycle time course 
expression data, represented as continuous curves. 
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Table 1: The reconstructed network structure with PPV=0.54 and Sensitiv- 
ity=0.80. The interactions retrieved from database are denoted by '□' and the 
interactions inferred by the model are denoted by 'x'. 
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statistical techniques such as cross-validation. Third, the existing optimization 
procedure can efficiently infer the network structure with the whole regulariza- 
tion path for the coefficients simultaneously and thus makes inference of large 
networks feasible. 

It is well known that biological side information can reduce the number of 
false positives and false negatives. It would be interesting to take into account 
such information in future work. For example, prior knowledge on the interac- 
tions of genes can be incorporated into the smoothing parameter so that different 
interacti on coefficie nts can be penalized differently, resulting in adaptive lasso 
penalty ( Zou . 20061 ). We expect this strategy will achieve desired improvement 
on network prediction. 



Funding 

This research was funded by Singapore Ministry of Education Tier 1 SUG. 



12 



Acknowledgement 



The author thanks Dr. Dougu Nam for providing the MATLAB code for 
LEARNe. 



References 

Bansal, M., Delia Gatta, G., and di Bernardo, D. (2006). Inference of gene regulatory networks 
and compound mode of action from time course gene expression profiles. Bioinformatics , 22(7), 
815-822. 

Chen, K. C, Wang, T. Y., Tseng, H. H., Huang, C. Y. F., and Kao, C. Y. (2005). A stochastic 
differential equation model for quantifying transcriptional regulatory network in saccharomyces 
cerevisiac. Bioinforrnatics, 21(12), 2883-2890. 

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. Annals of 
Statistics, 32(2), 407-451. 

Friedman, N. (2004). Inferring cellular networks using probabilistic graphical models. Science, 
303(5659), 799-805. 

Gardner, T. S., di Bernardo, D., Lorenz, D., and Collins, J. J. (2003). Inferring genetic networks 
and identifying compound mode of action via expression profiling. Science, 301(5629), 102-105. 

Hirose, O., Yoshida, R., Imoto, S., Yamaguchi, R., Higuchi, T., Charnock-Jones, D. S., Print, C, 
and Miyano, S. (2008). Statistical inference of transcriptional module-based gene networks from 
time course gene expression profiles by using state space models. Bioinformatics , 24(7), 932-942. 

Nam, D., Yoon, S. H., and Kim, J. F. (2007). Ensemble learning of genetic networks from time-series 
expression data. Bioinformatics, 23(23), 3225-3231. 

Ramsay, J. O. and Silverman, B. W. (2005). Functional data analysis. Springer series in statistics. 
Springer, New York, 2nd edition. 

Rosset, S. and Zhu, J. (2007). Piecewise linear regularized solution paths. Annals of Statistics, 
35(3), 1012-1030. 

Spcllman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., 
Botstein, D., and Futeher, B. (1998). Comprehensive identification of cell cycle-regulated genes 
of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the 
Cell, 9(12), 3273-3297. 

Teixeira, M. C, Monteiro, P., Jain, P., Tenreiro, S., Fernandcs, A. R., Mira, N. P., Alenquer, M., 
Freitas, A. T., Oliveira, A. L., and Sa-Correia, I. (2006). The yeastract database: a tool for 
the analysis of transcription regulatory associations in saccharomyces cerevisiae. Nucleic Acids 
Research, 34, D446-D451. 

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal 

Statistical Society Series B-Methodologieal, 58(1), 267-288. 

Yu, J., Smith, V. A., Wang, P. P., Hartemink, A. J., and Jarvis, E. D. (2004). Advances to 
bayesian network inference for generating causal networks from observational biological data. 

Bioinformatics, 20(18), 3594-3603. 

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical 
Association, 101(476), 1418-1429. 



13 



